Virtual host controller interface with multipath input/output

ABSTRACT

A processor-based system ( 200 ) with a multipath I/O architecture, including a virtual host controller interface (vHCI) layer ( 280 ) between a common architecture layer ( 270 ) and a physical host controller interface layer ( 290 ), which may include convential host bus adapters (HBAs) coupled to target decives such as storage devices ( 240, 250 ) in a storage area network (SAN). Target drivers send I/O requests to a common architecture layer, which forwards them to the vHCI layer ( 280 ), which then sends them to HBAs for sending to the target devices ( 240, 250 ). A multipathing driver interface (MPXIO) layer ( 310 ) resides beneath the vHCI layer ( 280 ), and determines target device path information for the vHCI layer ( 280 ). Positioning the MPXIO layer ( 310 ) beneath the vHCI layer avoids the need for multipathing target drivers ( 360 ) above the common architecture layer. A failover operations module may be provided for each type of target device to provide the vHCI layer ( 280 ) with failover protocol information in the event of a failed path.

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/283,659, of Osher et al., filed Apr. 13, 2001,entitled “Geometrically Accurate Compression and Decompression”.Attached as Appendix A to provisional application '659 is a documententitled “Multiplexed I/O (MPXIO)”, which gives implementation detailsfor an embodiment of the invention. Also attached to provisionalapplication '659, as Appendices B and C, are manual pages (man pages)that would be suitable for a UNIX (or other OS) implementation of thenew MPXIO architecture. The U.S. Provisional Application No. 60/257,210,with its Appendices, is incorporated herein by reference.

BACKGROUND OF THE INVENTION

[0002] This invention relates to a new system architecture providingmultiple input/output (I/O) paths to client devices, such as storagedevices, in a processor-based system or network.

[0003] As more systems use storage area networks (SANs), environmentsare created wherein multiple hosts are communicating with a givenstorage device. In both uniprocessor and multiprocessor settings,multiple paths are formed to the same storage device. These multiplepaths can provide greater bandwidth, load balancing, and highavailability (HA).

[0004] In I/O architectures currently in use, such multiple paths tostorage devices may be provided as illustrated in the storage areanetwork of FIG. 1. In this figure, a host system 10 is of conventionaldesign, using a processor 20, memory 30, and other standard componentsof a computer system (such as display, user input devices, and so on).The system 10 also typically includes one or several host bus adapters(HBAs) such as HBAs 40 and 50, which communicate via switches 60 and 70with storage devices 80 and 90, respectively. Alternatively, the storagedevices may be multiported, in which case the switches may not be used.

[0005] Software layers 100 are used by the host 10, and as shown in FIG.3, in systems currently in use a common architecture layer 110 may beprovided above the HBA layer, such as applicant Sun Microsystems, Inc.'s“SCSA” (Sun Common SCSI Architecture). Above this layer are devicedrivers (such as applicant's “SSDs”, i.e. Sun Microsystems, Inc.'s SCSIdisk drivers) 120 and 130. More specifically, these drivers 120 and 130are in this example different instances of the same device driver.

[0006] Above the device driver layer is a metadriver (MD) 140. When thehost 10 sends an I/O request to, e.g., storage device 80 (storage 90being omitted from FIG. 3 for simplicity), the request is sent throughthe metadriver 140 to the drivers 120 and 130. If one of the paths to astorage device fails (e.g. path 82 or 84 to storage 80, or path 92 or 94to storage 90), then it will be necessary to execute the I/O request viaa path that has not failed.

[0007] In the case of symmetric storage devices, the paths may easily beload balanced, and failover for an I/O request is accomplished simply byusing the non-failing path. For asymmetric devices, the system must beinformed that the first path has failed. For instance, in FIG. 2 if awrite command is sent via the metadriver 140 through driver 120 and SCSAlayer 110 to HBA 40, and it turns out that path 82 to storage 80 fails,then this is communicated back up to the driver 120, which willtypically execute additional tries. Each try may be very time-consuming,taking up to several minutes to execute. If path 82 has failed, this iswasted time; eventually, the driver 120 stops retrying, and themetadriver 140 will try the other path. Assuming path 84 is operational,the I/O attempt via driver 130 and HBA 50 will succeed.

[0008] In such a system, there are a number of inefficiencies, primarilyincluding the time wasted retrying the I/O request along a failed path.A system is needed that eliminates such inefficiencies, and inparticular that allows retrying of I/O requests more quickly along aworking path.

[0009] Issues with Using Multiple Driver Instances

[0010] An issue that arises in connection with multipath devices is thestructure of the Solaris (or other OS) device tree and the deviceautoconfiguration process. The OS device tree enumerates physicalconnections to devices; that is, a device instance is identified by itsconnection to its physical parent. This is in part due to the bottom-updevice autoconfiguration process as well as the lack of self-enumerationsupport in the I/O controllers available at the time this framework wasinitially designed.

[0011] The presence of multiple device instances for a single device canlead to various issues. One of these is wastefulness of systemresources, due to the consumption of system namespace and resources aseach path to a device is assigned a unique device instance and name.Thus, as the number of HCIs to common pools of devices increases, thenumbers of devices that can be hosted decreases. The minor number spaceavailable today for “sd” (SCSI disk) and “ssd” (which refers, e.g., tofibre channel SCSI disk device drivers) devices limits the Solaris OS to32K single-pathed drives. Each additional path to a pool of devicesdecreases this by a factor of 2.

[0012] Each duplicate instance wastes kernel memory in the form ofmultiple data structures and driver soft states. Inodes in the root filesystem are also wasted on the duplicated /devices and /dev entries.

[0013] Another issue that arises is that system administrators, as wellas applications, are faced with a challenges when attempting tounderstand and manage multipath configurations in the OS. Suchchallenges include:

[0014] 1. prtconf(1m): Since prtconf displays the structure of the OSdevice tree, it lists each instance of a multipath device. There is noway currently for a system administrator to quickly determine whichdevices in the output are in fact the same device. Another piece ofinformation that is lacking is the identity of the layered driver thatis “covering” this device and providing failover and/or load balancingservices.

[0015] 2. Lack of integration with DR (dynamic reconfiguration): DR hasno way of knowing if a device is attached to multiple parent devices; itis left up to the system administrator to identify and offline all pathsto a given device. Some of the layered products (e.g., DMPproducts—dynamic multipathing products) actually prevent DR fromoccurring as it holds the underlying devices open and does notparticipate in the DR and RCM (reconfiguration coordination manager)framework.

[0016] 3. Multiple names and namespaces in /dev: Each instance of amultipath disk device appears in /dev with a distinct logical controllername; the system administrator needs to be aware that a given device hasmultiple names, which can lead to errors during configuration ordiagnosis. In addition, layered products define additionalproduct-specific namespaces under /dev to represent their particularmultipath device, e.g. /dev/ap/{r}dsk/*, /dev/dmp/{r}dsk/*,/dev/osa/{r}dsk/*, etc. Both administrators and applications need to beaware of these additional namespaces, as well as knowing that themulti-instance names in /dev may be under the control of a layereddriver.

[0017] Another issue that arises due to the use of layered drivers hasto do with their statefulness. The layered driver approach becomessignificantly more difficult to implement once stateful drivers such astape drivers are deployed in multipath configurations. Driver state(such as tape position) needs to be shared between the multipleinstances via some protocol with the upper layered driver. This exposesan additional deficiency with using layered driver for multipathsolutions: a separate layered driver is needed for each class of driveror device that needs to be supported in these configurations.

[0018] Issues with Failover Operations

[0019] Yet another issue is that of failover/error management. Layereddrivers communicate with the underlying drivers via the buf(9s)structure. The format of this structure limits the amount of errorstatus information that can be returned by the underlying driver andthus limits the information available to the layered driver to makeproper failover decisions.

[0020] In addition, the handling of failover operations by a system suchas that shown in FIG. 1 can present other challenges. Switches 60 and 90are multiport switches, providing redundant paths to storage 80 (paths82 and 84) and storage 90 (paths 92 and 94). If path 86 to switch 60fails, the system needs to activate path 96, which will be a differentoperation for storage device 80 than for storage device 90, which ingeneral will be different types of storage devices.

[0021] An efficient way of activating paths common to different storagedevices, such as when a failover operation is executed, is thus needed.

SUMMARY OF THE INVENTION

[0022] A processor-based architecture according to an embodiment of thepresent invention includes a virtual host controller interface (vHCI)layer which handles I/O requests to target devices. This layer ispreferably beneath a common architecture layer, which is beneath thetarget drivers, and above the HBA or physical host controller interface(pHCI) layer. A multiplex I/O module discovers available paths to thetarget devices, and communicates these to the vHCI layer, which thenuses the path information to transmit the I/O requests to the targetdevices. In the case of a failed path, the vHCI can immediately send anI/O request by an alternate path, without the need to retry or to bouncethe failed I/O request back up to the driver layer. Use of the MPXIOmodule allows the multipathing protocol to be provided at a low level,thus avoiding the need for a multipathing target driver for each type oftarget used. The vHCI layer may also communicate with failoveroperations modules, which provide target device-specific information foreach type of target, and which may be compiled separately from the vHCIto allow addition of the modules to the system without having to reboot.

[0023] Other embodiments and features are discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

[0024]FIG. 1 is a block diagram of a conventional storage area network(SAN).

[0025]FIG. 2 is a block diagram showing a layered multipatharchitecture.

[0026]FIG. 3 is a block diagram of a new multipathing architectureaccording to the invention.

[0027]FIG. 4 is a block diagram showing details of the new architectureof FIG. 3.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0028] The system of the invention provides a new architecture forrepresenting and managing devices that are accessible through multiplehost controller interfaces (HCIs) from a given instance of the operatingsystem (OS). This type of device configuration, referred to herein as“multipath”, is capable of supporting failover semantics in the event ofinterconnect and controller failures, as well as balancing the I/Oworkload across the set of available controllers. This applicationdescribes the new multiplexed I/O (MPXIO) architecture, along with a setof changes to the core of the OS which support this architecture.

[0029] A suitable OS for an embodiment of the invention is applicant'sUNIX operating system, the Solaris OS. Hereinafter, reference will bemade to the “Solaris OS” or just to the OS, and it should be understoodthat this refers to the Solaris OS or any other suitable operatingsystem for the invention (e.g. other UNIX operating systems such asLinux, or non-UNIX operating systems).

[0030] Modern high-performance I/O bus architectures are migrating froma host-centric model, where storage is private to a single host, towardsthe SAN (Storage Area Network) model, where storage is treated in apeer-to-host-computers manner and is managed as a pool of resources tobe shared among multiple heterogeneous hosts via a shared I/Ointerconnect. Combined with this is an increase in pressure by RAS(reliability-availability-serviceability) requirements and performancemetrics, as the OS pushes deeper into enterprise and HA (highavailability) environments. This requires increasing degrees ofavailability, scalability, performance, and manageability.

[0031] Certain platforms, such as those of the Solaris OS, will attachto these SANs using multiple host controller interfaces and I/Ointerconnects controller interfaces (either of which types of interfacesmay referred to as HCIs), to increase both availability and I/Obandwidth to the storage pools. Some operating systems may not beoptimally designed for supporting the multipath device configurationspresented by these new SAN architectures. This stems from the fact thata given device that is visible through multiple host controllers isidentified as separate and independent device instances by suchoperating systems, e.g. the Solaris OS, and its suite of managementapplications.

[0032] The invention thus relates to a new software architecture formanaging multiported storage devices for processor-based systems.Background technology relating to this invention is described in thebook Writing Device Drivers (August 1997), a publication of SunSoft,Inc., which is a subsidiary of applicant Sun Microsystems, Inc. Thatbook is incorporated herein by reference.

[0033] The following example, taken from a system with a dual-pathedFibre Channel A5000 storage array of applicant Sun Microsystems, Inc.,illustrates this fact. Note the matching WWN (worldwide name) in theunit-address of the two ssd target devices:

[0034]/dev/dsk/c2t67d0s0->../../devices/pci@6,4000/SUNW,ifp@2/ssd@w220000203709c3f5,0:a

[0035]/dev/dsk/c3t67d0s0->../../devices/pci@6,4000/SUNW,ifp@3/ssd@w220000203709c3f5,0:a

[0036] In a current version of the Solaris OS, the operating systemwould not manage these multiple instances as a single device, but wouldleave it up to other products to do so, such products being implementedusing vendor-specific layered device drivers to provide failover andload balancing support. Such products include Alternate Pathing (AP—SunMicrosystems, Inc.), RDAC (Symbios/LSI), DMP (Veritas), and EMCPowerPath. However, each of these products suffers from a number oflimitations (including bugs), which can result from poor integrationwith the Solaris OS and their inability to interact seamlessly with oneanother.

[0037] Competing OS vendors such as SCO UnixWare, Data General, andEBM's Sequent all support multipath I/O as part of their enterprise highavailability storage solution. In addition, IBM's mainframe I/Oarchitecture supports multichannel device access.

[0038] General Design Features of the New Architecture

[0039] This invention involves a new architecture for naming andmanaging multipath devices in the OS. This architecture eliminates theneed for the multiple layered implementations available as unbundled orthird party products that currently support device multipathing.

[0040] A feature of one embodiment of the new architecture is that thedevice tree is restructured to permit a multipath device to berepresented as a single device instance in the OS, rather than havingone instance per physical path, as is the case in systems presently inuse. Multipath devices are attached below command-set specific pseudobusnexus drivers, otherwise called a virtual Host Controller Interface(vHCI) driver. Here, “virtual” refers to a reconfigurable orreprogrammable structure (which may include software or firmware, butmay also include hardware elements), in contrast to a conventional HBAor physical host controller interface.

[0041] vHCI drivers are provided with naming and transport services byone or more physical Host Controller Interface (pHCI) devices, whichshare the common command set or bus architecture such as SCSI-3.

[0042] The architecture also specifies a set of core services for pathmanagement support to be exported by vHCI drivers. vHCI drivers willalso implement a subset of the DR interfaces currently defined for busnexus drivers.

[0043] In addition, the architecture specifies a set of recommendedpractices for the command set-specific implementations to consider whereapplicable.

[0044] Specific Embodiments of the New Architecture

[0045]FIG. 3 is a block diagram showing a new architecture according toone embodiment of the invention. A host system 200, which may be aworkstation, personal computer, server, or the like, includes at leastone processor (though it may be a multiprocessor system) 210, memory220, and other conventional processor system features not separatelyshown, including user interface and input hardware and software(keyboard, mouse, etc.), a display, and any other components useful toinput or output of information and interaction with a user.

[0046] Software layers 230 reside on the system 200 and are executed bythe processor 210 and by appropriate dedicated hardware. The system 200communicates with storage devices 240 and 250. Although two storagedevices are shown for this embodiment, one storage device or more thantwo may be used. The storage devices may be any combination of tapestorage, disks, and other conventional storage hardware, using theappropriate drivers. For the description of FIG. 3, by way of examplethe storage devices 240 and 250 will be assumed to be the same type ofdevice, such as two disk drives, though any combination of devices ispossible.

[0047] A conventional driver 260 is used, which issues command packetsfor both storage devices 240 and 250. These are sent down to a commonarchitecture layer 270, which in the case of applicant's systems maybethe SCSA (Sun Common SCSI Architecture) layer, but in other systems willbe an equivalent software layer.

[0048] For the purposes of the invention, the terms “packet”, “request”and “command”, and other references to I/O communications, may be takenas referring to any I/O information or the like that may be communicatedalong a path from a processor, a user application, or other hardware orsoftware. It will be understood to those skilled in the art that suchpackets, etc. may be modified or adapted along the path to a targetdevice, and thus the forwarding, resending, etc. of a packet, command,or the like does not mean that the forwarded item is unaltered.

[0049] A system according to the present invention includes a virtualhost controller interface (vHCI) layer 280, which sends and receives I/Opackets or commands between the common architecture layer 270 and amultipath driver interface (MDI) layer 310, as well as physical hostcontroller interface (pHCI) 290 and pHCI 300. The pHCIs, which may beconventional host bus adapters (HBAs), provide the hardware interfacebetween the software layers 230 and the storage devices 240-250.

[0050] Thus, the driver 260 creates a command packet and sends it downto the SCSA layer, which hands the packet off to the vHCI layer. The MDIlayer includes a multiplexed I/O (MPXIO) module 320, which the vHCIlayer consults to determine which pHCI is in use, e.g. pHCI 290 or 300.

[0051] The multipath I/O module 320 in a preferred embodiment of theinvention takes the form of software associated with the MDI 310, butother embodiments are possible. In general, the architecture andfunctions of the present invention may be implemented as hardware,software and/or firmware in any combination appropriate to a givensetting. Thus; these terms should be interpreted as interchangeable forthe present invention, since where one is specified the others may beused. In particular, the terms “program”, “architecture”, “module”,“software” etc., may in practice be implemented as hardware, software,firmware or the like, as appropriate.

[0052] The MPXIO module 320 informs the vHCI layer 280 which pHCI is inuse (e.g. pHCI 300 for this example), and the vHCI layer 280 accordinglyhands the packet off to pHCI 300. The pHCIs 290 and 300 responsible forthe physical transport of packets over fibre channel or other networkconnection to their respective client devices.

[0053]FIG. 4 shows an embodiment of the invention incorporating featuresnot represented in FIG. 3, but the description of the common features ofthese two figures is applicable to both. FIG. 4 illustrates a hybridarrangement, in which both the vHCI layer 280 and MDI layer 310 are usedin connection with target drivers (e.g. disk drivers) 260-264, butadditional target drivers 266-268 are also whose I/O packets are notpassed through the vHCI and pHCI layers. The target drivers may be, aswith FIG. 3, disk drivers, tape drivers, and/or other combinations ofappropriate device drivers. The operation of the hybrid system isdiscussed in detail below.

[0054] Thus, the embodiment of FIG. 4 allows the use of prior,conventional architecture in combination with an architecture of thepresent invention, allowing for flexibility in the system'sconfiguration when an MDI layer 310 specific to one or more of theclient devices 330 has not yet been created.

[0055] The embodiment of FIG. 4 also shows a user application (forinstance, a database application) 340, a management application 350using a library 360 called “libdevinfo”, which provides user programsaccess to read-only information about the device tree. The libdevinfolibrary exports device nodes, minor nodes, and device properties in theform of a consistent “snapshot” of kernel state. Internally, libdevinfointeracts with the devinfo driver 370, which gathers state informationabout the device tree into a buffer for consumption by the libdevinfolibrary 360. See Appendix A, Section 5 for a more complete descriptionof how a conventional libdevinfo library may be modified and used inconjunction with an embodiment of the present invention.

[0056] Packet Flow According to the Invention

[0057] The general flow of an I/O request taken from the perspective ofa client driver is as follows:

[0058] 1. Allocate a command packet for use by the driver to constructthe I/O request. This may result in a call into the devices parent nexusdriver to allocate (HCI) resources for the command packet. e.g.scsi_init_pkt(9f).

[0059] 2. The driver prepares any data to be transmitted and initializesthe command packet to describe the specific I/O request—e.g.scsi_setup_cdb(9f).

[0060] 3. The driver submits the I/O request packet to the framework,which attempts to start or queue the request at the device's parent HCI,e.g. scsi_transport(9f).

[0061] 4. The drivers interrupt handler or command completion callbackfunction is invoked by the framework with a success or failure code forthe I/O request. If the request is completed in error, the driver mayfetch additional error status and choose to retry or fail the request.

[0062] This model lends itself well to disassociating multipath devicesfrom specific paths, since the decision of which pHCI device transportsthe I/O request is left to the framework, and is not known by the clientdriver making the request for transport services.

[0063] Implementation Concepts of the Invention

[0064] This section discusses concepts of the invention that may beimplemented in a UNIX or other OS setting.

[0065] 1. vHCI Drivers

[0066] The vHCI drivers of the invention are pseudo nexus drivers whichimplement multipath services for a specific command set or busarchitecture. There is a single instance of a vHCI driver for eachcommand set that supports multipath devices. The framework creates theinstance whenever a MPXIO-compliant pHCI driver registers its commandset transport services with the framework.

[0067] An instance of a vHCI driver preferably provides the followingservices or capabilities:

[0068] Single-instance multipath devices

[0069] Multipath configuration management

[0070] I/O request routing and policy-based load balancing

[0071] Path failover support

[0072] pHCI naming service interfaces

[0073] pHCI transport service interfaces

[0074] A. Single-Instance Multipath Devices

[0075] The vHCI nexus provides a path-independent bus nexus under whichmultipath client devices can be attached. Client devices are created asthey are registered with the framework by the pHCI devices.

[0076] This provides both a path-independent name in /dev and /devices,as well as eliminating the need for layered drivers to recombinemultiple devices instances back into a pseudo-single instance device.Eliminating multiple path-dependent names to a device provides a numberof additional side benefits:

[0077] Elimination of the “sliding controller” problem. The logicalcontroller number for clients of the “vHCI” driver instance will remainconstant since the vHCI driver will never move.

[0078] Eliminating the need for device renaming upon HCI replacement.Certain HCI devices are named using some form of GUID such as a WWN(worldwide name). If a device is replaced, the OS will treat any devicesattached to it as new devices, since the physical pathname to the devicehas changed. System administrators currently are forced to hand-editsensitive system configuration files and reboot, in the hope that thedevices will return to their original names. This runs the risk ofleaving the system unable to boot. Such a naming scheme may thus imposea naming restriction on pHCI drivers, namely that;

[0079] (a) pHCI drivers are required to support self-enumeration ofchild devices; and

[0080] (b) pHCI drivers must be capable of generating a uniqueidentifier (GUID) for a device prior to instantiate the device into theOS.

[0081] This present invention delivers an implementation of a vHCIdriver for SCSI-3 Fibre Channel devices. An appropriate name of the nodein the OS device tree would be:

[0082] /devices/scsi_vhci

[0083] with client (targets) devices having names of the form:

[0084] /devices/scsi_vhci/ssd@w220000203709c3f5,0:a

[0085] B. Multipath Configuration Management

[0086] With this architecture, the mapping of available paths to clientdevices is automatically discovered and managed by the framework as partof the client device enumeration and registration process undertaken bythe pHCI devices. This eliminates the need for static configurationdatabases, which typically contain data that could easily change infuture hardware configurations, which will be accommodated by thepresent invention.

[0087] The vHCI driver is also expected to supply interfaces touser-level system management applications for querying and managing thepathset configurations being maintained by an instance of a vHCI.

[0088] The vHCI query interfaces return the following types ofinformation:

[0089] 1. The list of pHCI devices providing transport services to thevHCI layer

[0090] 2. The list of pathsets maintained by the vHCI layer

[0091] 3. The list of client devices being maintained by the vHCI layer

[0092] 4. pHCI-specific information:

[0093] a. The list of attributes assigned to a pHCI device

[0094] b. The list of pathsets a given pHCI device is configured into

[0095] 5. Pathset-specific information:

[0096] a. The list of attributes assigned to a pathset

[0097] b. The list of pHCI devices configured into a pathset

[0098] 6. Client device-specific information

[0099] a. List the default pathset for the device

[0100] b. The list of pathsets from which a device is accessible

[0101] c. The list of pHCI interfaces from which a device is accessible

[0102] d. The list of attributes assigned to the client device

[0103] The vHCI path management interfaces support the following:

[0104] 1. Autocreation of default pathsets as client and pHCI devicesassemble;

[0105] 2. Dynamic creation of pathsets;

[0106] 3. Assigning of pHCI devices into specific pathsets;

[0107] 4. Assigning the default pathset ID for client devices;

[0108] 5. Removal of pHCI and client devices from existing pathsets;

[0109] 6. Setting the default pathset for specific client devices; and

[0110] 7. Setting attributes for a specific pathset, pHCI, or clientdevice.

[0111] C. I/O Request Routing and Policy-Based Load Balancing

[0112] The vHCI driver has the responsibility to select and route I/Orequests from client devices attached beneath it to the “best” pHCIdevice that is providing transport services to the device. This routingdecision considers both the default pathset assigned to a client devicerequest, as well as any routing policy such as round robin or least busywhich has been assigned to the pathset or client device.

[0113] D. Automatic and Manual Failover

[0114] The vHCI and pHCI drivers are responsible for managing failover,which is an important feature provided by this framework. Both automatic(e.g., a cable is unintentionally disconnected) and manual (e.g., asystem administrator dynamically reconfigures a system board containinga pHCI) failover semantics are needed to be compatible with the supportprovided by the existing layered products.

[0115] If an interconnect or device error is noted by a pHCI driverinstance, the vHCI layer is notified of the loss of transport serviceprovided by the pHCI.

[0116] In a conventional system as illustrated in FIG. 2, once themetadriver 140 submits an I/O request to the driver layer, it has nocontrol over that request until the driver gives up (e.g. in the eventof a path failure). The driver 120 or 130 has no information about amultipathing layer above it, so when an error is encountered, the drivermerely retries until a timeout or a predetermined number of retries hasoccurred, which can take several minutes for each retry.

[0117] Since the vHCI layer in the inventive design of FIGS. 3-4 isabove the pHCI layer and below the common architecture layer (and inparticular, below the target driver layer), any I/O request that comesback uncompleted is retried from the vHCI layer, which has informationabout other available paths because the multipating driver interface ison the same level as the vHCI. As a result, futile retries can beavoided, because the level that detects the failed path is the same asthe level that has information about alternative paths, unlike inprevious systems.

[0118] In the example discussed above for prior systems, where thedriver retries some number of times (e.g. twice) before sending afailure message up to the metadriver layer, in the present invention thevHCI can immediately (after a single failure) fail over to another path.Thus, the new system requires only two tries (one failed and onesuccessful) to complete the I/O request, rather than four tries for theexample given for prior systems, resulting in a significant timesavings.

[0119] This points up an advantage of the new architecture: in a systemas shown in FIG. 2, the disk (or tape) drivers must bemultipathing—i.e., for each device type, a multipathing driver for thatparticular device type is needed. In the present invention as shown inFIG. 3, by way of contrast, the multipathing is handled at the vHCIlayer, and the device-specific issues are handled at the target driverlayer, so the multipathing module or layer 310 does not need to beprogrammed to handle the device-specific issues.

[0120] As a result, once a device driver is created, there are noadditional issues involved in placing it in a multipathing setting. ThevHCI is preferably written from the beginning to accommodate variousdevice formats (disks, tapes, etc.), and all the device-specific actions(retries, error recoveries, etc.) happen at the target driver level.This isolation of the multipathing functionality at the vHCI levelavoids the need for duplicating multipathing intelligence at differentmetadriver levels and integrating the multipathing into many differentdrivers. In addition to avoiding the need for a great deal ofduplicative programming (for the different device types), it greatlyreduces the number of different types of drivers needed.

[0121] A failover operation in the embodiment of FIGS. 3-4 proceeds asfollows. When a given path such as path 292 (to bus 296, which connectsto devices 330) fails, another path (e.g. path 294) is needed. It wouldbe possible to code all the different device-specific information at thevHCI layer 280, but this would mean that any new device type that isadded would require modification of the vHCI layer. Thus, preferably aset of one or more failover ops (operations) modules 272-276 is created,once for each type of storage device.

[0122] When the vHCI 280 needs to activate a path, it accesses theappropriate failover ops module (e.g. module 272) and sends and“activate” command. The module 272 then connects to the appropriate HBA(pHCI) driver with the correct protocol.

[0123] This modular approach allows new device types to be added merelyby adding a new failover ops module, and otherwise leaving the vHCIlayer unchanged. The vHCI and all of the failover ops modules can becompiled into a single driver at boot-up, or the failover ops modulesmay be compiled separately. In the latter case, it is possible tohot-plug a new device into the system and provide its failover opsmodule for real-time device discovery and operation. If the vHCI isregarded as a standard interface, then different companies' devices cansimply connect to that interface, and a heterogeneous storageenvironment is created with automatic failover capability.

[0124] 2. vHCI-pHCI Driver Interface

[0125] Since the vHCI and pHCI drivers implement to a common command setand bus protocol (such as SCSI-3), the interface between the two driversis specific to the implementation. In a SCSI-3 implementation, both thevHCI driver and pHCI drivers are implemented in the model of a SCSI HBA.

[0126] 3. pHCI Driver Changes

[0127] The physical HCI drivers are changed only moderately by thisarchitecture; the most significant change that of bus enumeration.

[0128] Device enumeration: instead of attaching identified child devicesto the individual pHCI device instances, the pHCI drivers will callmdi_devi_identify(9m) to notify the framework of identity and visibilityof the device from the particular pHCI instance. The framework willeither a) create a new instance for the device under the vHCI layer ifit does not already exist or b) register the pHCI device as an availabletransport for the device.

[0129] A pHCI driver is expected to support the bus_config(9b) andbus_unconfig(9b) nexus driver busop entry points. The vHCI driver willinvoke these entry points to manually drive enumeration of specificallynamed devices.

[0130] 4. Paths and Pathsets

[0131] Another feature of the proposed architecture is the addition ofthe notion of paths and pathset as manageable objects in the OS.

[0132] A path may be defined as a software representation of a hardwaredevice which is providing device identification and transport servicesfor a command set implementing this architecture. A path may haveattributes assigned which describe the capabilities of the path to thevHCI driver implementation.

[0133] Pathsets, as the name suggests, are aggregations of paths, andare a natural addition to the processor set model already in the OS.

[0134] The framework defines a number of default pathsets to defineaggregations such as “all-available-paths”. The framework also supportscreation and management of pathsets by applications. Systemadministrators could use this feature to bind specific pHCI devices to apool of database storage devices to isolate database traffic from otherthe effects of other users of the system.

[0135] The vHCI driver checks the pathset information assigned to thepacket; if none has been defined at packet level, the vHCI driver usesthe default pathset that is defined for the device.

[0136] In a system according the foregoing description, the vHCI layermanages multiple pHCIs, resulting in several important advantages,including:

[0137] 1. It simplifies device naming. The system now only sees a singleSSD device name for each SSD device. (SSD refers, for example, to afibre channel SCSI disk device driver.)

[0138] 2. It provides a load balancing mechanism. Since there aremultiple paths to the target devices (accessed through different ports),the system can implement a load balancing mechanism to access differentdevices by these different paths, as desired.

[0139] 3. It provides a failover mechanism. Target devices with multipleports (e.g. a disk drive with two ports) may be asymmetric, i.e. thetarget device can be accessed through only one port at a time. One portis thus active, and the other (in the case of two ports) is passive, orinactive.

[0140] If the active port is down, i.e. is not functioning for somereason, the pHCI notifies the vHCI layer, as well as the MPXIO layer,and the vHCI layer initiates a failover to the inactive port, making itactive.

[0141] Features of Various Embodiments of the Invention

[0142] IP multipathing which enables link aggregation and failover forsystems with multiple network interfaces is an appropriate setting foran embodiment of this invention, providing equivalent functionality formultipath storage devices.

[0143] Following are other features in connection with which the presentinvention can be implemented:

[0144] Modify the core OS to support MPXIO devices, including supportfor booting, DR, and power management.

[0145] Define a generic scheme for representing single instance MPXIOdevices within the OS.

[0146] Enable multipath device configurations to dynamicallyself-assemble during boot and dynamic reconfiguration, not relying uponon-disk configuration databases to describe the multipath configuration.

[0147] Define a common architecture for I/O path management in the OS.

[0148] Define a set of requirements to be implemented by theMPXIO-compliant target and HCI drivers (properties and behavior).

[0149] Support automatic failover to route I/O requests throughalternate active paths on transport failures.

[0150] Support manual switchover to enable dynamic reconfiguration.

[0151] Provide tunable load balancing for improved I/O performance.Initial implementation will include a simple Preferred path (Priorityscheme) and Round Robin load balancing schemes. Other implementationsmay include such schemes as Least I/Os per path and Least blocks perpath.

[0152] Integrate with other multipathing solutions.

[0153] This architecture is suitable for an environment in which theclient devices for a given command set can be uniquely identified usingsome form of GUID prior to configuring the device into the OS devicetree.

[0154] Data Security

[0155] In a conventional system such as in FIG. 2, multiple drivers 120,130, etc. are used. For multipath I/O, an I/O request should go throughthe metadriver layer 140, and thence through a driver to the commonarchitecture layer 110, and through an HBA to the storage device.However, it is possible for an application to write directly to a driver(which will be identifiable through a UNIX “format” command), bypassingthe metadriver layer, while another application may be writing via themetadriver, resulting in data corruption.

[0156] Since the present invention places the multipathing layer belowthe target driver layer, this bypass is closed off. All I/O requests toa storage device in FIG. 3 or 4 must pass through the driver (260-264)ultimately to the vHCI layer, which handles the multipathing. Sincethere is only one entry point, there is no opportunity for a user towrite an application that bypasses the multipathing driver interfacelayer 310.

[0157] Hybrid System Operation: FIG. 4

[0158] In FIG. 4, paths 284 and 286 connect the vHCI layer 280 to theMDI layer 310, which in turn connects via paths 312 and 314 to the pHCIs290 and 300, respectively. In addition, direct paths 282 and 288 connectthe vHCI layer 280 directly to the pHCIs 290 and 300, i.e. withoutpassing through the MDI layer 310 (though, depending upon theembodiment, there may be other hardware or software on these otherwisedirect paths).

[0159] At boot-up or at other selected times (e.g. when a device ishot-plugged into the system), the pHCIs execute a device enumeration ordiscovery operation, determining the number, nature and paths of thevarious devices on the system. Device discovery itself can be done in aconventional manner. The device and path information is stored at theMDI level, which is preferably provided with a database or table forthis purpose.

[0160] When discovery is complete, an I/O request coming down from atarget driver via the common architecture layer 270 is sent to the vHCIlayer 280. The vHCI provides the requested device information to theMDI, and the MDI—which has the information about paths to thedevices—selects and sends back information about an available path tothe vHCI.

[0161] The vHCI has the information about the pHCIs, so for the givenpHCI it retrieves a “handle” (a path_info_node), which includes datastructures used to communicate directly to the given pHCI. For instance,if this pHCI is pHCI 290, then the vHCI uses the direct path 282.

[0162] Each pHCI (or HBA) thus implements a set of interfaces that aredefined by the common architecture (e.g. SCSA), which define the methodsneeded, e.g. for transporting a packet.

[0163] In the system shown in FIG. 4, the common architecture layer 270can also communicate directly to one or more pHCIs (here, pHCI 300).Here, “directly” means without passing through a virtual host controllerinterface layer, though there may be other software or hardware elementsalong the path from the common architecture layer to a given pHCI. Tothe common architecture layer, the vHCI appears as simply another HBA,so when an I/O request comes from target driver 266 or 268, the commonarchitecture layer treats it in the same manner as a request from atarget driver 260-264, though in one case the request goes to the pHCI300 and in the other it goes to the vHCI 280.

[0164] Thus, adding the MDI layer 310 allows the vHCI layer 280 tomanage I/O requests to all connected devices 330, by effectively actingas a virtually single HBA (from the point of view of the commonarchitecture layer), but in fact communicating with multiple HBAs(pHCIs).

What is claimed is:
 1. A multipathing subsystem in a computer systemhaving a system interface layer and at least one physical hostcontroller interface (pHCI) in a pHCI layer, including: a virtual hostcontroller interface (vHCI) coupled to and interacting with the systeminterface layer; wherein the vHCI layer interacts also with a multipathI/O module and is configured to transporting target driver commandpackets to an underlying pHCI.
 2. The subsystem of claim 1, where thesystem includes multiple pHCIs, which provide a physical transportmechanism to transport command packets to target devices.
 3. Aprocessor-based system, including: a processor configured to generateinput/output (I/O) requests; a memory in communication with theprocessor and configured to store program instructions; at least onetarget driver module stored in the memory and configured to generateinstructions specific to at least one target device in communicationwith the system, the target device being coupled to the system by atleast a first path and a second path; a virtual host controllerinterface in communication with the target driver module; a multipathdriver module in communication with the virtual host controllerinterface; and at least one physical interface configured to couple thedevice; wherein the multipath driver module is configured to generatepath information relating to paths to each target device and provide thepath information to the virtual host controller interface, and thevirtual host controller interface is configured to pass I/O requests toa physical interface associated with a selected target device.
 4. Thesystem of claim 3, wherein the multipath driver module is configured todetermine each of a plurality of paths to the target device at apredetermined time.
 5. The system of claim 4, where the predeterminedtime is at boot-up of the system.
 6. The system of claim 4, wherein thepredetermined time is at a user-selected time after boot-up of thesystem.
 7. The system of claim 3, wherein the virtual host controllerinterface is further configured, in the event of detection of a failedpath, to resend an I/O request to the selected target device by way ofanother path.
 8. The system of claim 7, further including a commonarchitecture layer between the target driver and the virtual hostcontroller interface, the common architecture layer configured togenerate instructions adapted for execution by the at least one physicalinterface.
 9. The system of claim 8, wherein the common architecturelayer is configured to communicate with at least a first target devicethrough the virtual host controller interface and at least a firstphysical interface, and is coupled directly to at least a secondphysical interface to communicate with at least a second target device.10. The system of claim 1, including a plurality of target drivermodules configured to communicate with a plurality of physicalinterfaces, wherein at least a first target driver module is coupled toa first physical interface through the virtual host controllerinterface, and at least a second target driver module is coupleddirectly to a second physical interface.
 11. The system of claim 1,further including at least one failover operations module configured toprovide a failover protocol relating to a first target device to thevirtual host controller interface.
 12. The system of claim 11, whereinthe failover operations module is configured to compile with the virtualhost controller interface.
 13. The system of claim 11, wherein thefailover operations module is configured to compile separately from thevirtual host controller interface.
 14. The system of claim 13,configured to receive a second failover operations module withoutrecompiling the virtual host controller interface.
 15. A method formultipath input/output (I/O) communication with at least one targetdevice coupled to a processor-based system, including the steps of:sending a first I/O packet from a first target driver to a virtual hostcontroller interface; forwarding the I/O packet from the virtual hostcontroller interface to a first physical interface; and transmitting theI/O packet from the first physical interface to the target device;wherein the virtual host controller interface includes informationrelating to a plurality of paths to the target device.
 16. The method ofclaim 15, further including, before the sending step, the steps of:locating a number of target devices coupled to the system; and for eachtarget device, determining at least one path to that device.
 17. Themethod of claim 16, further including, in the event of a failure of apath between the virtual host controller interface and a first targetdevice, the steps of: selecting an alternative path to the first targetdevice; and reforwarding the I/O packet from the virtual host controllerinterface.
 18. The method of claim 16, wherein the locating anddetermining steps are carried out at least at a time of boot-up of thesystem.
 19. The method of claim 16, wherein the locating and determiningsteps are carried out at least at a time other than boot-up of thesystem.
 20. The method of claim 16, further including the steps of:providing failover information to the virtual host controller interfacerelating to a first target device; and in the event of a failed path tothe first target device, reforwarding the I/O packet from the virtualhost controller interface, using the failover information to accessanother path to the first target device.